The purpose of this markdown is to explore the relationship between TOC and snowmelt signals within the CLP, with the intention of creating a baseline forecast model to measure further model development against.

Exploring DOC/TOC interoperability

Look at relationship between TOC and DOC, does this get us some “more” data that we can use DOC “as” TOC.

chem %>% 
  filter(!is.na(TOC) & !is.na(DOC)) %>% 
  ggplot(., aes(x = TOC, y = DOC)) + 
  geom_point() + 
  facet_wrap(Site ~ .)

The answer is, kind of yes. Exceptions: basically everything after Canyon Mouth. Let’s also look at water source type for this:

chem %>% 
  filter(!is.na(TOC) & !is.na(DOC)) %>% 
  left_join(., locations) %>% 
  ggplot(., aes(x = TOC, y = DOC)) + 
  geom_point() + 
  facet_wrap(location_type ~ .)

Let’s look at mainstem here:

chem %>% 
  filter(!is.na(TOC) & !is.na(DOC) & location_type == "Mainstem") %>% 
  left_join(., locations) %>% 
  ggplot(., aes(x = TOC, y = DOC)) + 
  geom_point() + 
  facet_wrap(Campaign ~ .)

Okay, this aligns with the relationship falling apart after the Canyon Mouth.

chem %>% 
  filter(!is.na(TOC) & !is.na(DOC) & grepl("Upper", Campaign)) %>% 
  left_join(., locations) %>% 
  ggplot(., aes(x = TOC, y = DOC)) + geom_point() + 
  facet_wrap(Site ~ .)

Let’s also look at the South Fork

chem %>% 
  filter(!is.na(TOC) & !is.na(DOC) & Campaign == "South Fork") %>% 
  left_join(., locations) %>% 
  ggplot(., aes(x = TOC, y = DOC)) + geom_point() + 
  facet_wrap(Site ~ .)

Great, so let’s look at DOC across the upper sites and South Fork:

chem %>% 
  filter(grepl("Upper", Campaign) | Campaign == "South Fork") %>% 
  mutate(dist_name = paste(distance_upstream_km, Site, sep = " - ")) %>% 
  ggplot(., aes(x = yday(Date), y = DOC, color = year(Date))) + 
  geom_point() + 
  facet_wrap(dist_name ~.)

Okay, these are all right-on - the correlation is something like 0.97, the intercept is nearly zero and the slope is nearly 1, so we’re just going to say that for these sites, DOC = TOC.

And removing post-fire 2021, 2022, since we’ll assume that DOC is not as dominant of a part of TOC during this time, or at the very least might be a different proportion of TOC during this recovery time. (We only have DOC-TOC matches post-fire.)

chem %>% 
  filter(!year(Date) %in% c(2021, 2022) & grepl("Upper|South", Campaign)) %>% 
  mutate(dist_name = paste(distance_upstream_km, Site, sep = " - "),
         ordered_name = factor(Site, levels = c("Poudre at Bellvue Diversion",
                                                "Poudre Above North Fork",
                                                "Poudre at Manner's Bridge",
                                                "Poudre Below S. Fork",
                                                "Poudre Below Rustic",
                                                "Little Beaver",
                                                "Pennock (Signal Mountain)",
                                                "Poudre below Poudre Falls",
                                                "Poudre Above Joe Wright",
                                                "Sawmill"))) %>%
  ggplot(., aes(x = yday(Date), y = DOC, color = year(Date))) + 
  geom_point() + 
  facet_wrap(Site + distance_upstream_km + Campaign ~.)

The consistency between 2015 and post-fire leads me to believe that there is probably consistency in the relationship between DOC and TOC in 2015 data and post-fire data, since it doesn’t seem significantly larger or smaller than pre-fire.

For now, let’s just limit this to Mainstem - Upper, Upper Tributaries, and South Fork Campaigns and Mainstem, Stream, Inflow location types

chem %>% 
  filter(grepl("Upper|South", Campaign) & 
           location_type %in% c("Mainstem", "Stream", "Inflow") &
           !year(Date) %in% c(2021, 2022)) %>% 
  mutate(dist_name = paste(distance_upstream_km, Site, sep = " - "),
         ordered_name = factor(Site, levels = c("Poudre at Bellvue Diversion",
                                                "Poudre Above North Fork",
                                                "Poudre at Manner's Bridge",
                                                "Poudre Below S. Fork", 
                                                "Poudre Below Rustic",
                                                "Little Beaver",
                                                "Poudre South Fork",
                                                "Pennock (Signal Mountain)",
                                                "Hourglass Inflow",
                                                "Beaver",
                                                "Comanche Inflow",
                                                "Poudre at Sleeping Elephant",
                                                "Poudre below Poudre Falls",
                                                "Poudre Above Joe Wright",
                                                "Sawmill"))) %>% 
  ggplot(., aes(x = yday(Date), y = DOC, color = year(Date))) + 
  geom_point() + 
  facet_wrap(ordered_name + distance_upstream_km ~.)

GAM time

Great, now let’s play with a GAM fit, for now, we wont do a TVT, because I just want to see if it works. I have played with including the Upper Tribs, but they don’t add much value here, so leaving out.

mainstem <- chem %>% 
  filter(Campaign %in% c("Mainstem - Upper", "South Fork") & 
           location_type %in% c("Mainstem", "Stream", "Inflow")) %>% 
  mutate(DOY = yday(Date),
         year = year(Date), 
         toc_doc = if_else(!is.na(TOC), TOC, DOC),
         # to play with categorical variables, they failed.
         identity = if_else(Campaign == "South Fork", 1, 0),
         fire_signal = if_else(Year %in% c(2021, 2022), 1, 0)) %>% 
  # limit time, remove fire signal, remove flier
  filter(!is.na(toc_doc), between(DOY, 100, 320), fire_signal == 0, toc_doc < 9)

# see what this looks like
TOC_GAM <- gam(toc_doc ~ te(DOY, distance_upstream_km, k = c(6, 6)),
               data = mainstem)

gam.check(TOC_GAM)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 7 iterations.
## The RMS GCV score gradient at convergence was 7.782563e-06 .
## The Hessian was positive definite.
## Model rank =  36 / 36 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                                k'  edf k-index p-value
## te(DOY,distance_upstream_km) 35.0 28.1    1.01    0.56
plot(TOC_GAM, scheme = 2, too.far = 0.1, shade = TRUE)

vis.gam(TOC_GAM, view = c("DOY", "distance_upstream_km"),
        plot.type = "persp", theta = 30, phi = 30,
        ticktype = "detailed")

Okay, this seems to work pretty well. We do, however, want to do some thoughtful TVT here to keep our error estimates honest. First let’s look at annual distribution since we want to test in 2025.

mainstem %>% 
  summarize(n = n(), 
            .by = year)
## # A tibble: 5 × 2
##    year     n
##   <dbl> <int>
## 1  2014     5
## 2  2015    32
## 3  2023   101
## 4  2024   180
## 5  2025    57

Maybe we actually need to split by S Fork and Mainstem so we don’t mess with the huge contributions of 2024 data or super minimal years like… all others.

mainstem %>% 
  summarise(n = n(),
            .by = c(year, Campaign))
## # A tibble: 10 × 3
##     year Campaign             n
##    <dbl> <chr>            <int>
##  1  2014 Mainstem - Upper     4
##  2  2014 South Fork           1
##  3  2015 Mainstem - Upper    24
##  4  2015 South Fork           8
##  5  2023 South Fork          39
##  6  2023 Mainstem - Upper    62
##  7  2024 Mainstem - Upper   126
##  8  2024 South Fork          54
##  9  2025 Mainstem - Upper    46
## 10  2025 South Fork          11

In order to get some good cross validation and mix in SF/Mainstem dynamics into the TVT, here’s the plan:

GAM diagnostics are interpreted using the gam.check() function. We want to make sure that:

  1. the model converged (second line of the output), the convergnece value is very small, and ideally model rank is last (26/26 for instance)
  2. We are confident we are not finding a local minima in the convergence (“Hessian was positive difinitive”)
  3. that the k is ~ 1 and the p value > 0.05
  4. we want the lowest k (default is 25 or c(5, 5)) while not maxing out the wiggles which is determined by k’ (max number of wiggles) and edf (actual wiggles used)

Some notes on interpretation and tuning:

My rule of thumb for making sure you don’t overfit: if the edf value of your model is less than the k’ of a lower k, you might be overfit:

edf = 20, k’ = 48 (implying the k = c(7,7), or 7x7-1), you’d want to try k = c(6,6), which would have a k’ of 35 (6x6-1), and ideally the edf would still be low with a smaller parameterization space.

CV Model 1

valid_1 <- mainstem %>% filter(year == 2015)
train_1 <- anti_join(mainstem, valid_1) %>% 
  filter(year != 2025)

# fit gam
fit_1 <- gam(toc_doc ~ te(DOY, distance_upstream_km, k = c(6, 6)), data = train_1, method = "GCV.Cp")

# check gam diagnostics
gam.check(fit_1)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 8 iterations.
## The RMS GCV score gradient at convergence was 1.888723e-07 .
## The Hessian was positive definite.
## Model rank =  36 / 36 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                                k'  edf k-index p-value
## te(DOY,distance_upstream_km) 35.0 20.7    0.96     0.2
# visualize 2d/3d
vis.gam(fit_1,
        view = c("DOY", "distance_upstream_km"),
        plot.type = "contour",
        color = "topo",
        too.far = 0.1)

vis.gam(fit_1,
        view = c("DOY", "distance_upstream_km"),
        plot.type = "persp",
        theta = 30, phi = 30,
        ticktype = "detailed")

# apply to validation
valid_1$pred <- predict(fit_1, newdata = valid_1)

# and report performance
mae(valid_1$toc_doc, valid_1$pred)
## [1] 0.5743317
rmse(valid_1$toc_doc, valid_1$pred)
## [1] 0.7427479
bias(valid_1$toc_doc, valid_1$pred)
## [1] 0.4286212

CV Model 2

valid_2 <- mainstem %>% filter(year == 2023 & Campaign == "Mainstem - Upper")
train_2 <- anti_join(mainstem, valid_2) %>% 
  filter(year != 2025)

# fit gam
fit_2 <- gam(toc_doc ~ te(DOY, distance_upstream_km, k = c(6, 6)), data = train_2, method = "GCV.Cp")

# check gam diagnostics
gam.check(fit_2)  

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 7 iterations.
## The RMS GCV score gradient at convergence was 4.447032e-06 .
## The Hessian was positive definite.
## Model rank =  36 / 36 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                                k'  edf k-index p-value
## te(DOY,distance_upstream_km) 35.0 31.8    0.96    0.17
# visualize 2d/3d
vis.gam(fit_2,
        view = c("DOY", "distance_upstream_km"),
        plot.type = "contour",
        color = "topo",
        too.far = 0.1)

vis.gam(fit_2,
        view = c("DOY", "distance_upstream_km"),
        plot.type = "persp",
        theta = 30, phi = 30,
        ticktype = "detailed")

# apply to validation
valid_2$pred <- predict(fit_2, newdata = valid_2)

# and report performance
mae(valid_2$toc_doc, valid_2$pred)
## [1] 0.6436643
rmse(valid_2$toc_doc, valid_2$pred)
## [1] 0.803653
bias(valid_2$toc_doc, valid_2$pred)
## [1] 0.3206894

CV Model 3

valid_3 <- mainstem %>% filter(year == 2023 & Campaign == "South Fork")
train_3 <- anti_join(mainstem, valid_3) %>% 
  filter(year != 2025)

# fit gam
fit_3 <- gam(toc_doc ~ te(DOY, distance_upstream_km, k = c(6, 6)), data = train_3, method = "GCV.Cp")

# check gam diagnostics
gam.check(fit_3) 

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 8 iterations.
## The RMS GCV score gradient at convergence was 2.970693e-07 .
## The Hessian was positive definite.
## Model rank =  36 / 36 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                                k'  edf k-index p-value
## te(DOY,distance_upstream_km) 35.0 26.4    0.95    0.17
# visualize 2d/3d
vis.gam(fit_3,
        view = c("DOY", "distance_upstream_km"),
        plot.type = "contour",
        color = "topo",
        too.far = 0.1)

vis.gam(fit_3,
        view = c("DOY", "distance_upstream_km"),
        plot.type = "persp",
        theta = 30, phi = 30,
        ticktype = "detailed")

# apply to validation
valid_3$pred <- predict(fit_3, newdata = valid_3)

# and report performance
mae(valid_3$toc_doc, valid_3$pred)
## [1] 0.721934
rmse(valid_3$toc_doc, valid_3$pred)
## [1] 0.8513111
bias(valid_3$toc_doc, valid_3$pred)
## [1] 0.4633756

CV Model 4

valid_4 <- mainstem %>% filter(year == 2024 & Campaign == "Mainstem - Upper")
train_4 <- anti_join(mainstem, valid_4) %>% 
  filter(year != 2025)

# fit gam
fit_4 <- gam(toc_doc ~ te(DOY, distance_upstream_km, k = c(6, 6)), data = train_4, method = "GCV.Cp")

# check gam diagnostics
gam.check(fit_4)  

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 6 iterations.
## The RMS GCV score gradient at convergence was 1.100435e-06 .
## The Hessian was positive definite.
## Model rank =  36 / 36 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                                k'  edf k-index p-value
## te(DOY,distance_upstream_km) 35.0 27.4    1.14       1
# visualize 2d/3d
vis.gam(fit_4,
        view = c("DOY", "distance_upstream_km"),
        plot.type = "contour",
        color = "topo",
        too.far = 0.1)

vis.gam(fit_4,
        view = c("DOY", "distance_upstream_km"),
        plot.type = "persp",
        theta = 30, phi = 30,
        ticktype = "detailed")

# apply to validation
valid_4$pred <- predict(fit_4, newdata = valid_4)

# and report performance
mae(valid_4$toc_doc, valid_4$pred)
## [1] 0.585372
rmse(valid_4$toc_doc, valid_4$pred)
## [1] 0.7030657
bias(valid_4$toc_doc, valid_4$pred)
## [1] -0.4378622

CV Model 5

valid_5 <- mainstem %>% filter(year == 2024 & Campaign == "South Fork")
train_5 <- anti_join(mainstem, valid_5) %>% 
  filter(year != 2025)

# fit gam
fit_5 <- gam(toc_doc ~ te(DOY, distance_upstream_km, k = c(6, 6)), data = train_5, method = "GCV.Cp")

# check gam diagnostics
gam.check(fit_5) 

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 7 iterations.
## The RMS GCV score gradient at convergence was 3.591996e-06 .
## The Hessian was positive definite.
## Model rank =  36 / 36 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                                k'  edf k-index p-value
## te(DOY,distance_upstream_km) 35.0 21.7    0.96    0.19
# visualize 2d/3d
vis.gam(fit_5,
        view = c("DOY", "distance_upstream_km"),
        plot.type = "contour",
        color = "topo",
        too.far = 0.1)

vis.gam(fit_5,
        view = c("DOY", "distance_upstream_km"),
        plot.type = "persp",
        theta = 30, phi = 30,
        ticktype = "detailed")

# apply to validation
valid_5$pred <- predict(fit_5, newdata = valid_5)

# and report performance
mae(valid_5$toc_doc, valid_5$pred)
## [1] 0.5304314
rmse(valid_5$toc_doc, valid_5$pred)
## [1] 0.6753088
bias(valid_5$toc_doc, valid_5$pred)
## [1] -0.3052135

Apply ensemble to test set

test_2025 <- mainstem %>% filter(year == 2025)

# apply ensemble
test_2025 <- test_2025 %>%
  mutate(
    pred1 = predict(fit_1,newdata = test_2025),
    pred2 = predict(fit_2, newdata = test_2025),
    pred3 = predict(fit_3, newdata = test_2025),
    pred4 = predict(fit_4, newdata = test_2025),
    pred5 = predict(fit_5, newdata = test_2025),
    pred_avg = (pred1 + pred2 + pred3 + pred4 + pred5) / 5
  )

# calculate performance metrics
mae_val  <- mae(test_2025$toc_doc, test_2025$pred_avg)
rmse_val <- rmse(test_2025$toc_doc, test_2025$pred_avg)
bias_val <- bias(test_2025$toc_doc, test_2025$pred_avg)

# r2 from linear regression
r2_val <- summary(lm(test_2025$toc_doc ~test_2025$pred_avg))$r.squared

# format annotation text for plot
metrics_text <- paste0(
  "MAE = ", round(mae_val, 2),
  "\nRMSE = ", round(rmse_val, 2),
  "\nBias = ", round(bias_val, 2),
  "\nR² = ", round(r2_val, 2)
)


# pred - obs scatter plot!
ggplot(test_2025, aes(x = toc_doc, y = pred_avg)) +
  geom_point(color = "blue", alpha = 0.6, size = 2) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(x = "Observed TOC (mg/L)",
       y = "Predicted TOC (ensemble, mg/L)",
       title = "GAM Ensemble Predictions vs Observed (2025)") +
  annotate("text", x = 3, y = 8, 
           hjust = 0, vjust = 1, 
           label = metrics_text, size = 4) +
  theme_bw(base_size = 14) +
  coord_cartesian(xlim = c(3, 8), ylim = c(3, 8))

# for kicks, look at performance by site
ggplot(test_2025, aes(x = toc_doc, y = pred_avg)) +
  geom_point(color = "blue", alpha = 0.6, size = 2) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(x = "Observed TOC (mg/L)",
       y = "Predicted TOC (ensemble, mg/L)",
       title = "GAM Ensemble Predictions vs Observed (2025)") +
  theme_bw(base_size = 14) +
  coord_cartesian(xlim = c(3, 8), ylim = c(3, 8)) +
  facet_wrap(Site ~ .)

Visualize the decision space for interpretability

GAMs are pretty interpretable when they have fewer interacting variables. In this case, we can just create a synthetic space/time space and apply the model across it to visualize what happens up the canyon over time.

# Make a prediction grid across DOY and distance
newdata <- expand.grid(
  DOY = seq(min(mainstem$DOY), max(mainstem$DOY)),
  distance_upstream_km = seq(min(mainstem$distance_upstream_km),
                             max(mainstem$distance_upstream_km))
)

# Get predictions from your GAMs for the grid
newdata$fit1 <- predict(fit_1, newdata = newdata)
newdata$fit2 <- predict(fit_2, newdata = newdata)
newdata$fit3 <- predict(fit_3, newdata = newdata)
newdata$fit4 <- predict(fit_4, newdata = newdata)
newdata$fit5 <- predict(fit_5, newdata = newdata)

# create ensemble prediction
newdata <- newdata %>% 
  mutate(ave_fit = (fit1 + fit2 + fit3 + fit4 + fit5) / 5)

# heatmap + contour plot
ggplot(newdata, aes(x = DOY, y = distance_upstream_km, fill = ave_fit)) +
  geom_tile() +
  geom_contour(aes(z = ave_fit), color = "white", alpha = 0.7) +
  scale_fill_viridis_c(option = "plasma") +
  labs(
    fill = "Predicted TOC",
    x = "Day of Year",
    y = "Distance upstream from Canyon Mouth (km)"
  ) +
  theme_minimal(base_size = 14) 

And let’s save these models as the baseline:

# write_rds(fit_1, here("data/upper_clp_dss/modeling/TOC_baseline/TOC_GAM_fit1_v2025-10-01.rds"))
# write_rds(fit_2, here("data/upper_clp_dss/modeling/TOC_baseline/TOC_GAM_fit2_v2025-10-01.rds"))
# write_rds(fit_3, here("data/upper_clp_dss/modeling/TOC_baseline/TOC_GAM_fit3_v2025-10-01.rds"))
# write_rds(fit_4, here("data/upper_clp_dss/modeling/TOC_baseline/TOC_GAM_fit4_v2025-10-01.rds"))
# write_rds(fit_5, here("data/upper_clp_dss/modeling/TOC_baseline/TOC_GAM_fit5_v2025-10-01.rds"))